Columbia
Appendix A V ariational Paragraph Embedder A.1 Selection of substitution rate p
Figure 4: Impact of the proportion of injected noise for learning Paragraph Em-beddings on XSum dataset. (Figure 4). The results of the ablation study are presented in Table 5. Embedder in providing clean and denoised reconstructions. In general, it has been observed that generations progress in a coarse-to-fine manner. The early time step, which is close to 1, tends to be less fluent and generic. This was the nicest stay we have ever had. Turtle Bay was a great resort. This was the nicest stay we have ever had.
Insulin Resistance Prediction From Wearables and Routine Blood Biomarkers
Metwally, Ahmed A., Heydari, A. Ali, McDuff, Daniel, Solot, Alexandru, Esmaeilpour, Zeinab, Faranesh, Anthony Z, Zhou, Menglian, Savage, David B., Heneghan, Conor, Patel, Shwetak, Speed, Cathy, Prieto, Javier L.
Insulin resistance, a precursor to type 2 diabetes, is characterized by impaired insulin action in tissues. Current methods for measuring insulin resistance, while effective, are expensive, inaccessible, not widely available and hinder opportunities for early intervention. In this study, we remotely recruited the largest dataset to date across the US to study insulin resistance (N=1,165 participants, with median BMI=28 kg/m2, age=45 years, HbA1c=5.4%), incorporating wearable device time series data and blood biomarkers, including the ground-truth measure of insulin resistance, homeostatic model assessment for insulin resistance (HOMA-IR). We developed deep neural network models to predict insulin resistance based on readily available digital and blood biomarkers. Our results show that our models can predict insulin resistance by combining both wearable data and readily available blood biomarkers better than either of the two data sources separately (R2=0.5, auROC=0.80, Sensitivity=76%, and specificity 84%). The model showed 93% sensitivity and 95% adjusted specificity in obese and sedentary participants, a subpopulation most vulnerable to developing type 2 diabetes and who could benefit most from early intervention. Rigorous evaluation of model performance, including interpretability, and robustness, facilitates generalizability across larger cohorts, which is demonstrated by reproducing the prediction performance on an independent validation cohort (N=72 participants). Additionally, we demonstrated how the predicted insulin resistance can be integrated into a large language model agent to help understand and contextualize HOMA-IR values, facilitating interpretation and safe personalized recommendations. This work offers the potential for early detection of people at risk of type 2 diabetes and thereby facilitate earlier implementation of preventative strategies.
Patients Speak, AI Listens: LLM-based Analysis of Online Reviews Uncovers Key Drivers for Urgent Care Satisfaction
Xu, Xiaoran, Xue, Zhaoqian, Zhang, Chi, Medri, Jhonatan, Xiong, Junjie, Zhou, Jiayan, Jin, Jin, Zhang, Yongfeng, Ma, Siyuan, Li, Lingyao
Investigating the public experience of urgent care facilities is essential for promoting community healthcare development. Traditional survey methods often fall short due to limited scope, time, and spatial coverage. Crowdsourcing through online reviews or social media offers a valuable approach to gaining such insights. With recent advancements in large language models (LLMs), extracting nuanced perceptions from reviews has become feasible. This study collects Google Maps reviews across the DMV and Florida areas and conducts prompt engineering with the GPT model to analyze the aspect-based sentiment of urgent care. We first analyze the geospatial patterns of various aspects, including interpersonal factors, operational efficiency, technical quality, finances, and facilities. Next, we determine Census Block Group(CBG)-level characteristics underpinning differences in public perception, including population density, median income, GINI Index, rent-to-income ratio, household below poverty rate, no insurance rate, and unemployment rate. Our results show that interpersonal factors and operational efficiency emerge as the strongest determinants of patient satisfaction in urgent care, while technical quality, finances, and facilities show no significant independent effects when adjusted for in multivariate models. Among socioeconomic and demographic factors, only population density demonstrates a significant but modest association with patient ratings, while the remaining factors exhibit no significant correlations. Overall, this study highlights the potential of crowdsourcing to uncover the key factors that matter to residents and provide valuable insights for stakeholders to improve public satisfaction with urgent care.
Wiki-Quantities and Wiki-Measurements: Datasets of Quantities and their Measurement Context from Wikipedia
Göpfert, Jan, Kuckertz, Patrick, Weinand, Jann M., Stolten, Detlef
To cope with the large number of publications, more and more researchers are automatically extracting data of interest using natural language processing methods based on supervised learning. Much data, especially in the natural and engineering sciences, is quantitative, but there is a lack of datasets for identifying quantities and their context in text. To address this issue, we present two large datasets based on Wikipedia and Wikidata: Wiki-Quantities is a dataset consisting of over 1.2 million annotated quantities in the English-language Wikipedia. Wiki-Measurements is a dataset of 38 738 annotated quantities in the English-language Wikipedia along with their respective measured entity, property, and optional qualifiers. Manual validation of 100 samples each of Wiki-Quantities and Wiki-Measurements found 100% and 84-94% correct, respectively. The datasets can be used in pipeline approaches to measurement extraction, where quantities are first identified and then their measurement context. To allow reproduction of this work using newer or different versions of Wikipedia and Wikidata, we publish the code used to create the datasets along with the data.
Passive Heart Rate Monitoring During Smartphone Use in Everyday Life
Liao, Shun, Di Achille, Paolo, Wu, Jiang, Borac, Silviu, Wang, Jonathan, Liu, Xin, Teasley, Eric, Cai, Lawrence, Liu, Yun, McDuff, Daniel, Su, Hao-Wei, Winslow, Brent, Pathak, Anupam, Patel, Shwetak, Taylor, James A., Rogers, Jameson K., Poh, Ming-Zher
Resting heart rate (RHR) is an important biomarker of cardiovascular health and mortality, but tracking it longitudinally generally requires a wearable device, limiting its availability. We present PHRM, a deep learning system for passive heart rate (HR) and RHR measurements during everyday smartphone use, using facial video-based photoplethysmography. Our system was developed using 225,773 videos from 495 participants and validated on 185,970 videos from 205 participants in laboratory and free-living conditions, representing the largest validation study of its kind. Compared to reference electrocardiogram, PHRM achieved a mean absolute percentage error (MAPE) < 10% for HR measurements across three skin tone groups of light, medium and dark pigmentation; MAPE for each skin tone group was non-inferior versus the others. Daily RHR measured by PHRM had a mean absolute error < 5 bpm compared to a wearable HR tracker, and was associated with known risk factors. These results highlight the potential of smartphones to enable passive and equitable heart health monitoring.
Automatic Input Rewriting Improves Translation with Large Language Models
Can we improve machine translation (MT) with LLMs by rewriting their inputs automatically? Users commonly rely on the intuition that well-written text is easier to translate when using off-the-shelf MT systems. LLMs can rewrite text in many ways but in the context of MT, these capabilities have been primarily exploited to rewrite outputs via post-editing. We present an empirical study of 21 input rewriting methods with 3 open-weight LLMs for translating from English into 6 target languages. We show that text simplification is the most effective MT-agnostic rewrite strategy and that it can be improved further when using quality estimation to assess translatability. Human evaluation further confirms that simplified rewrites and their MT outputs both largely preserve the original meaning of the source and MT. These results suggest LLM-assisted input rewriting as a promising direction for improving translations.
Beyond Trusting Trust: Multi-Model Validation for Robust Code Generation
UMBC CODEBOT '25 Workshop Columbia, MD / 25-26 February 2025BEYOND TRUSTING TRUST: MUL TI-MODEL V ALIDA TION FOR ROBUST CODE GENERA TION Bradley McDanel Franklin and Marshall College bmcdanel@fandm.edu 1 Introduction Ken Thompson's 1984 essay "Reflections on Trusting Trust" demonstrated that even carefully reviewed source code could hide malicious behavior through compromised compilers - because the malicious code exists only in the compiled binary form, not its source [1]. Today, large language models (LLMs) used as code generators [2, 3] present an even more opaque security challenge than classical compilers. While compiler binaries can be analyzed for malicious behavior, LLMs operate through vast matrices of weights combined in non-linear ways, making it difficult to develop robust methods for identifying embedded behaviors [4, 5]. This paper revisits Thompson's analogy in the context of LLM-based code generation. We show how malicious behavior might be subtly embedded into a widely used model and argue that direct inspection of the model's parameters is currently infeasible.
RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care
Yang, Ziqi, Lu, Yuxuan, Bagdasarian, Jennifer, Swain, Vedant Das, Agarwal, Ritu, Campbell, Collin, Al-Refaire, Waddah, El-Bayoumi, Jehan, Gao, Guodong, Wang, Dakuo, Yao, Bingsheng, Shara, Nawar
Cancer surgery is a key treatment for gastrointestinal (GI) cancers, a group of cancers that account for more than 35% of cancer-related deaths worldwide, but postoperative complications are unpredictable and can be life-threatening. In this paper, we investigate how recent advancements in large language models (LLMs) can benefit remote patient monitoring (RPM) systems through clinical integration by designing RECOVER, an LLM-powered RPM system for postoperative GI cancer care. To closely engage stakeholders in the design process, we first conducted seven participatory design sessions with five clinical staff and interviewed five cancer patients to derive six major design strategies for integrating clinical guidelines and information needs into LLM-based RPM systems. We then designed and implemented RECOVER, which features an LLM-powered conversational agent for cancer patients and an interactive dashboard for clinical staff to enable efficient postoperative RPM. Finally, we used RECOVER as a pilot system to assess the implementation of our design strategies with four clinical staff and five patients, providing design implications by identifying crucial design elements, offering insights on responsible AI, and outlining opportunities for future LLM-powered RPM systems.